Dealing with Input Noise in Statistical Machine Translation

نویسندگان

  • Lluís Formiga
  • José A. R. Fonollosa
چکیده

Misspelled words have a direct impact on the final quality obtained by Statistical Machine Translation (SMT) systems as the input becomes noisy and unpredictable. This paper presents some improvement strategies for translating real-life noisy input. The proposed strategies are based on a preprocessing step consisting in a character-based translator (MT) from noisy into cleaned text. The use of a character-level translator allows us to provide various spelling alternatives in a lattice format to the final bilingual translator. Therefore, the final MT is the one that decides the best path to be translated. The different hypotheses are obtained under the assumption of a noisy channel model for this task. This paper shows the experiments done with real-life noisy input and a standard phrase-based SMT system from English into Spanish. TITLE AND ABSTRACT IN ANOTHER LANGUAGE, SPANISH Estudio de estrategias para tratar los errores ortográficos en la entrada de los sistemas de traducción automática estadística Las palabras con errores ortográficos tienen un impacto directo en la calidad final obtenida por los sistemas de traducción automática estadística (TA) debido a que la entrada se vuelve ruidosa e impredecible. Este artículo presenta algunas estrategias de mejora a la hora de traducir textos de entrada con ruido del mundo real. Estas estrategias consisten en la adición de un paso de preproceso basado en un traductor a nivel de carácter de texto ruidoso a texto limpio. El uso de un traductor a nivel de carácter permite proporcionar diversas alternativas de ortografía en un formato de lattice como entrada del traductor bilingüe final. Por lo tanto, es el traductor final quien decide la mejor secuencia de palabras a traducir. Para esta tarea, las diferentes hipótesis se obtienen bajo suponiendo un modelo de distorsión del canal. En este trabajo presentamos los experimentos realizados con textos reales de entrada ruidosa y un sistema estándar de traducción auotmática estadística de inglés a español.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Finite-state transducer inference for a speech-input Portuguese-to-English machine translation system

Statistical techniques and grammatical inference have been used for dealing with automatic speech recognition with success, and can also be used for speech-to-speech machine translation. In this paper, new advances on a method for finite-state transducer inference are presented. This method has been tested experimentally in a speech-input translation task using a recognizer that allows a flexib...

متن کامل

Statistical Machine Translation of European Parliamentary Speeches

In this paper we present the ongoing work at RWTH Aachen University for building a speechto-speech translation system within the TCStar project. The corpus we work on consists of parliamentary speeches held in the European Plenary Sessions. To our knowledge, this is the first project that focuses on speech-to-speech translation applied to a real-life task. We describe the statistical approach u...

متن کامل

Finite-state transducer inference for a spee machine translation

Statistical techniques and grammatical inference have been used for dealing with automatic speech recognition with success, and can also be used for speech-to-speech machine translation. In this paper, new advances on a method for finite-state transducer inference are presented. This method has been tested experimentally in a speech-input translation task using a recognizer that allows a flexib...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Word Lattices for Multi-Source Translation

Multi-source statistical machine translation is the process of generating a single translation from multiple inputs. Previous work has focused primarily on selecting from potential outputs of separate translation systems, and solely on multi-parallel corpora and test sets. We demonstrate how multi-source translation can be adapted for multiple monolingual inputs. We also examine different appro...

متن کامل

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012